Design-HA Use Cases

single-server setup

note: generally to ensure HA changes for multiple servers have not introduced installation / upgrade regressions for single-server setups

start with existing (upgrade) or new (overwrite) db
use installer to add server: name it server-1, leave default affinity group option selected
test behavior of old agent trying to connect to new server
- should fail and report protocol version mismatch error

cache load after agent registration

register a new agent agent-1 to this server
tail the server log and make sure the cache gets loaded for this agent
using HAAC agent view, verify agent-1 failover list is server-1 only.

cache load after agent reconnect

stop the agent
start the agent (without "--clean")
tail the server log and make sure the cache gets reloaded for this agent

cache load after server failure

kill the server
wait several minutes for agent data to spool up
restart the server
tail the server log and make sure the cache gets reloaded for this agent
- note: the cache must be loaded BEFORE any agent reports are sent

cache load after server maint mode

put the server into MM
tail the agent logs to ensure it is initiating failover
put the server into NORMAL mode
tail the server log and make sure the cache gets reloaded for this agent
- note: the cache must be loaded BEFORE any agent reports are sent

verify HA console affinity group view

register 5 more agents to the server (for a total of 6)
go to HA admin console, "affinity groups" section
ensure that servers and agents listed all have an empty affinity group value
using HAAC agent view, all 6 agents have failover list with server-1 only.
using HAAC server list view, server-1 should show agent count of 6 assuming all agents are running)

verify server HA tracking data

log into the application as the superuser
go to HA admin console, "servers" section
verify that there is the correct # of servers
- exposed at the proper IPs/ports?
- in the proper affinity groups?
- properly shows which agents are connected to which servers?
- do you see which servers are up/down correctly?
also, repeat "verify HA console affinity group view"

verify agent HA tracking data

go to HA admin console, "agents" section
- ensure affinity group information matches expected values for all agents listed
verify agent resource configurations
- go to configuration>current subtab for each agent, verify the affinity group configuration property is correct for that agent
- verify the most recent configuration update history reflects the correct value for each agent
also, repeat "verify HA console affinity group view"

HA console-based affinity group update

go to HA admin console, "affinity groups" section
click edit
use UI controls to move servers and/or agents into different affinity groups
click save
verify results (there will be some redundancy since verify server/agent HA data tasks will both require verifying the data on the HA admin console, "affinity groups" section)
- repeat "verify server HA tracking data"
- repeat "verify agent HA tracking data"

ensure agent distribution

log into HA admin console
click re-partition button
use other HA console UI pages to ensure that:
- the agents are distributed correctly across all servers
  - If affinity is in use the distribution may not be even. Satisfying affinity is weighted more highly than even distribution. See HA Load Balancing for more detail.
- the failover list for each agent include all servers (no servers should be listed in duplicate)

add server to the cloud

note: this performs an install using the same installer, but with different options. since you'll be configuring this server against the same existing database, this will be an HA install. it tests a different installer path, and also introduces server-side HA principles that need to be tested.

use installer, and point to the same database / port / dbuser you did earlier
name this instance server-2 (should run on a different machine or, minimally, the same machine and different port)
verify that the web UI can be reached from any server endpoint
repeat "ensure even agent load distribution"
- Installing a server to the cloud will repartition the agents. Note that affinity will still be satisfied so previous agent affinity should still hold, creating an uneven distribution.
- Note that affinity assignment changes also will repartition the agents.

simulate server crash

take down one of the servers - shutdown operation (graceful) or kill it
repeat "verify server HA tracking data"
repeat "ensure even agent load distribution"
- A server going down, or into maintenance mode does not repartition. Server lists remain the same and agents will fail over using their existing lists.

server maintenance mode

log into HA admin console, "servers" section
take an enterprise-wide snapshot of which agents are connected to which servers
click "maintenance mode" button next to any server
repeat "ensure AG-aware agent load distribution", but do not click the re-partition button
- note, a full redistribution is not performed at this time - only the agents connected to the server that went into maintenance should fail over to their secondaries; validate this by taking another enterprise-wide snapshot of which agents are connected to which servers and compare against previous shapshot
click the "normal" button for the same server again (ending the maintenance period)
repeat "ensure AG-aware agent distribution", this time click the re-partition button as normal
- Cloud member operation mode changes (going up or down, or in and out of maintenance mode) don't really affect the agent distribution algorithm. Server lists will include cloud members that may be temporarily unavailable. So, re-partitioning at this point should not have any impact on distribution.
- Cloud size changes do affect the agent distribution. So, install or deletion of servers will have a major affect on distribution.

verify server maint mode

use the HA admin console > servers section to put one server into maint mode
wait a few moments (agents will failover to the remain server in NORMAL mode)
use HA admin console > servers section to ensure all agents have connected to a single server
put the server back into NORMAL mode
after some time (should be no more than 1 hour) agents will switch back to their primary server.
- if you want to speed up this test you can reduce the 1 hour setting in the agent configuration (rhq.agent.primary-server-switchover-check-interval-msecs) to, say 10 minutes.
ensure all agents are connected to their primary server

network blip / hiccup testing

temporarily block an agent from connecting to a server
- firewall setting / port forwarding config / unplug one of them from the wall
if time is short (~15secs) verify that the agent does not fail over
- use HA admin console to verify that agent failover history is unchanged
if time is long (~1min) verify that the agent fails over
- use HA admin console to verify that agent failover history has new entries

ensure AG-aware agent load distribution

repeat "add server to the cloud", for a total of 3 servers in the cloud
log into HA admin console
click re-partition button
use other HA console UI pages to ensure that:
- the agents are evenly load-distributed across all servers (should be 2 agents per server, if no affinity is assigned)
- the failover list for each agent include all servers (no servers should be listed in duplicate)

affinity groups (single-server membership)

note: affinity groups (AG) provide a mechanism for agents to prefer to connect to some servers over others.

log into HA admin console
assign 1 server to AG-1 and 2 agents to AG-1 (the rest of the agents/servers won't be in any AG)
- make sure the agents you assign to AG-1 are currently NOT connected to the server put in AG-1
click re-partition button, and wait a while
- You could wait quite a while for this (like a day, as agents do not "pull" a new server list very often. As an alternative you can:
  - Restart the agents
  - Use the new agent operation, via the GUI, to force (all of) the agents to update their lists (this is preferred as the agent keeps running).
ensure that the AG-1 agents are now connected to the AG-1 server
ensure the other 4 agents are evenly distributed across the remaining two non-AG servers

affinity groups (multi-server membership with failover)

repeat "add server to the cloud", for a total of 4 servers in the cloud
log into HA admin console
assign 2 more servers to AG-1 (now there are 3 servers in AG-1, and 1 server in no AG)
assign 2 more agents to AG-1 (now there are 4 agents in AG-1, and 2 agents in no AG)
click re-partition button, and wait a while
- You could wait quite a while for this (like a day, as agents do not "pull" a new server list very often. As an alternative you can:
  - Restart the agents
  - Use the new agent operation, via the GUI, to force (all of) the agents to update their lists (this is preferred as the agent keeps running).
ensure that the 4 AG-1 agents are now connected to one of the 3 AG-1 servers
ensure the other 2 agents are connected to the non-AG servers
put one of the AG-1 servers into MM
ensure that the 4 AG-1 agents are now connected to one of the 2 remaining AG-1 servers
ensure the other 2 agents are still the only ones connected to the non-AG servers

JBoss Community Archive (Read Only)

RHQ 4.9